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Abstract 

Background: Progress in genome sequencing is proceeding at an exponential pace, and several new algal 
genomes are becoming available every year. One of the challenges facing the community is the association of 
protein sequences encoded in the genomes with biological function. While most genome assembly projects 
generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from 
a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' 
genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several 
independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as 
DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene 
lists. While several such databases have been constructed for animals, none is currently available for the study of 
algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal 
genome sequences, a significant need has arisen for such a database to process the growing compendiums of 
algal genomic data. 

Description: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating 
annotation data from several pathway, ontology, and protein family databases. The current version provides 
annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The 
site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. 
Additionally, expression data for several experimental conditions were compiled and analyzed to provide an 
expression-based enrichment search. A tool to search for functionally-related genes based on gene expression 
across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway 
maps and batch gene identifier conversion. 

Conclusions: The Algal Functional Annotation Tool aims to provide an integrated data-mining environment for 
algal genomics by combining data from multiple annotation databases into a centralized tool. This site is 
designed to expedite the process of functional annotation and the interpretation of gene lists, such as those 
derived from high-throughput RNA-seq experiments. The tool is publicly available at http://pathways.mcdb. 
ucla.edu. 
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Background 

Next-generation sequencers are revolutionizing our abil- 
ity to sequence the genomes of new algae efficiently and 
in a cost effective manner. Several assembly tools have 
been developed that take short read data and assemble 
it into large continuous fragments of DNA. Gene pre- 
diction tools are also available which identify coding 
structures within these fragments. The resulting tran- 
scripts can then be analyzed to generate predicted pro- 
tein sequences. The function of these protein sequences 
are subsequently determined by searching for close 
homologs in protein databases and transferring the 
annotation between the two proteins. While some ver- 
sions of the previously described data processing pipe- 
line have become commonplace in genome projects, the 
resulting functional annotation is typically fairly minimal 
and includes only limited biological pathway information 
and protein structure annotation. In contrast, the inte- 
gration of a variety of pathway, function and protein 
databases allows for the generation of much richer and 
more valuable annotations for each protein. 

A second challenge is the use of these protein-level 
annotations to interpret the output of genome-scale 
profiling experiments. High-throughput genomic tech- 
niques, such as RNA-seq experiments, produce mea- 
surements of large numbers of genes relevant to the 
biological processes being studied. In order to inter- 
pret the biological relevance of these gene lists, which 
commonly range in size from hundreds to thousands 
of genes, the members must be functionally classified 
into biological pathways and cellular mechanisms. 
Traditionally, the genes within these lists are exam- 
ined using independent annotation databases to assign 
functions and pathways. Several of these annotation 
databases, such as the Kyoto Encyclopedia of Genes 
and Genomes (KEGG) [1], MetaCyc [2], and Pfam [3], 
include a rich set of functional data useful for these 
purposes. 

However, presently researchers must explore these dif- 
ferent knowledge bases separately, which requires a sub- 
stantial amount of time and effort. Furthermore, 
without systematic integration of annotation data, it 
may be difficult to arrive at a cohesive biological picture. 
In addition, many of these annotation databases were 
designed to accommodate a single gene search, a meth- 
odology not optimal for functionally interpreting the 
large lists of genes derived from high-throughput geno- 
mic techniques. Thus, while modern genomic experi- 
ments generate data for many genes in parallel, their 
output must often still be analyzed on a gene-by-gene 
basis across different databases. This fragmented analysis 
approach presents a significant bottleneck in the pipe- 
line of biological discovery. 



One approach to solving this problem is integrating 
information from multiple annotation databases and 
providing access to the combined biological data from a 
single comprehensive portal that is equipped with the 
proper statistical foundations to effectively analyze large 
gene lists. For example, the DAVID database integrates 
information from several pathway, ontology, and protein 
family databases [4] . Similarly, Ingenuity Pathway Analy- 
sis (IPA) provides an integrated knowledge base derived 
from published literature for the human genome [5]. 
The integrated functional information and annotation 
terms are then assigned to lists of genes and for some 
analyses, enrichment tests are performed to determine 
which biological terms are overrepresented within the 
group of genes. By combining the information found in 
a number of knowledge bases and performing the analy- 
sis of lists of genes, these tools permit the efficient pro- 
cessing of high-throughput genomic experiments and 
thus expedite the process of biological discovery. How- 
ever, most of these integrated databases have been 
developed for the analysis of well-annotated and thor- 
oughly studied organisms, and are lacking for many 
newly genome-enabled organisms. 

One large group of organisms for which integrated 
functional databases are lacking are the algae. The algae 
constitute a branch in the plant kingdom, although they 
form a polyphyletic group as they do not include all the 
descendants of their last common ancestor. As many as 
10 algal genomes have been sequenced, including those 
of a red alga and several chlorophyte algae, with several 
more in the pipeline [6-11]. Algal genomic studies have 
provided insights into photosymbiosis, evolutionary rela- 
tionships between the different species of algae, as well 
as their unique properties and adaptations. Recently, 
there has been a renewed interest in the study of algal 
biochemistry and biology for their potential use in the 
development of renewable biofuels [reviewed in [12]]. 
This has promoted the study of varied biochemical pro- 
cesses in diverse algae, such as hydrogen metabolism, 
fermentation, lipid biosynthesis, photosynthesis and 
nutrient assimilation [13-20]. One of the most studied 
algae is Chlamydomonas reinhardtii. It has a sequenced 
genome that has been assembled into large scaffolds 
that are placed on to chromosomes [6]. For many years, 
Chlamydomonas has served as a reference organism for 
the study of photosynthesis, photoreceptors, chloroplast 
biology and diseases involving flagellar dysfunction 
[21-25]. Its transcriptome has recently been profiled by 
RNA-seq experiments under various conditions of nutri- 
ent deprivation [[26,27], unpublished data (Castruita M., 
et al.)]. 

While Chlamydomonas has been extensively charac- 
terized experimentally, annotation of its genome is still 
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approximate. Although KEGG categorizes some C. rein- 
hardtii gene models into biological pathways, other 
databases - such as Reactome [28] - do not directly pro- 
vide information for proteins of this green alga. Compli- 
cating the analysis of Chlamydomonas genes is the fact 
that there are two assemblies of the genome in use (ver- 
sion 3 and version 4) and multiple sets of gene models 
have been developed that are catalogued under diverse 
identifiers: Joint Genome Institute (JGI) FM3.1 protein 
IDs for the version 3 assembly, and JGI version FM4 
protein IDs and Augustus version 5 IDs for the version 
4 assembly [11,29]. The differences between these 
assemblies are significant; for example, the version 3 
assembly contains 1,557 continuous segments of 
sequence while the fourth version contains 88. Although 
the version 3 assembly is superseded by version 4, users 
presently access version 3 because of the richer user- 
based functional annotations. In addition, other sets of 
gene predictions have been generated using a variety of 
additional data, including ESTs and RNA-seq data, to 
more accurately delineate start and stop positions and 
improve upon existing gene models. One such gene pre- 
diction set is Augustus ul0.2. As such, there are a vari- 
ety of gene models between different assemblies being 
simultaneously used by researchers, presenting compli- 
cations in genomics studies. To facilitate the analysis of 
Chlamydomonas genome-scale data, we developed the 
Algal Functional Annotation Tool, which provides a 
comprehensive analysis suite for functionally interpret- 
ing C. reinhardtii genes across all available protein iden- 
tifiers. This web-based tool provides an integrative data- 
mining environment that assigns pathway, ontology, and 
protein family terms to proteins of C. reinhardtii and 
enables term enrichment analysis for lists of genes. 
Expression data for several experimental conditions are 
also integrated into the tool, allowing the determination 
of overrepresented differentially expressed conditions. 



Table 1 List of annotation resources integrated into the 
Algal Functional Annotation Tool 



Resource 


URL 


Reference 


KEGG 


http://www.genome.jp/kegg/ 


[1] 


MetaCyc 


http://www.metacyc.org/ 


[2] 


Pfam 


http://pfam.sanger.ac.uk 


[31 


Reactome 


http://www.reactome.org/ 


[28] 


Panther 


http://www.pantherdb.org/pathway 


[30] 


Gene Ontology 


http://www.geneontology.org/ 


[31] 


InterPro 


http://www.ebi.ac.uk/interpro 


[32] 


MapMan 
Ontology 


http://mapman.gabipd.org/ 


[33] 


KOG 


http://www.ncbi.n 1 m.n i h.gov/COG/g race/ 
shokog.cgi 


[35] 



Primary databases used to functionally annotate gene models and integrated 
into the Algal Functional Annotation Tool. 



Additionally, a gene similarity search tool allows for 
genes with similar expression patterns to be identified 
based on expression levels across these conditions. 

Construction and Content 

Integration of Multiple Annotation Databases 

The Algal Functional Annotation Tool integrates anno- 
tation data from the biological knowledge bases listed in 
Table 1. Publically available flat files containing annota- 
tion data were downloaded and parsed for each indivi- 
dual resource. Chlamydomonas reinhardtii proteins 
were assigned KEGG pathway annotations by means of 
sequence similarity to proteins within the KEGG genes 
database [1]. MetaCyc [2], Reactome [28], and Panther 
[30] pathway annotations were assigned to C. reinhardtii 
proteins by sequence similarity to subsets of UniProt 
IDs annotated in each corresponding database. In all 
cases, sequence similarity was determined by BLAST. 
BLAST results were filtered to contain only best hits 
with an E-value < le-05. 

Gene Ontology (GO) [31] terms were downloaded 
from the Chlamydomonas reinhardtii annotation pro- 
vided by JGI. These GO terms were associated with 
their respective ancestors in the hierarchical ontology 
structure to include broader functional terms and pro- 
vide a complete annotation set. Pfam domain annota- 
tions were assigned by direct search against protein 
domain signatures provided by Pfam. InterPro [32] and 
user-submitted manual annotations are based on those 
contained within JGI's annotation of the C. reinhardtii 
genome [11]. These methods were applied to four types 
of gene identifiers commonly used for C. reinhardtii 
proteins: JGI protein identifiers (versions 3 and 4) and 
Augustus gene models (versions 5 and 10.2). In total, 
over 12,600 unique functional annotation terms were 
assigned to 65,494 C. reinhardtii gene models spanning 
four different gene identifier types by these methods 
(Table 2). These assigned annotations may be explored 
for single genes using a built-in keyword search tool as 
well as an integrated annotation lookup tool which dis- 
plays all annotations for a particular identifier. 

Assignment of Annotation from Arabidopsis thaliana 

To extend the terms associated with C. reinhartdii 
genes, functional terms were inferred by homology to 
the annotation set of the plant Arabidopsis thaliana 
(thale cress). Identification of orthologous proteins was 
based on sequence similarity and subsequent filtering of 
the results by retaining only mutual best hits between 
the two sets of protein sequences. The corresponding 
Arabidopsis thaliana annotation was used to supple- 
ment GO terms and was similarly expanded to contain 
term ancestry. The A. thaliana annotations of the Map- 
Man Ontology [33] and MetaCyc Pathway database [2] 
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Table 2 Number of gene identifiers associated with annotation databases 



Identifier Type 


Total Gene IDs 


KEGG 


Reactome 


Panther 


Gene Ontology 


MapMan 


KOG 


Pfam 


InterPro 


JGI v3.0 


14598 


5348 


2740 


1147 


6563 


5214 


9139 


7166 


7532 


JGI v4.0 


16706 


4232 


1949 


1085 


7568 


3171 


9973 


7305 


8151 


Augustus v5.0 


16888 


4686 


2983 


1673 


4334 


3160 


5123 


8202 


5202 


Augustus u10.2 


17302 


4583 


3326 


1913 


6956 


3892 


8977 


8691 


7464 



Number of Chlamydomonas reinhardtii identifiers with at least one functional annotation for each primary database, shown per identifier type. 



were also used to provide more complete annotation 
coverage of the C. reinhardii genome. 

Functional Term Enrichment Testing 

The hypergeometric distribution is commonly used to 
determine the significance of functional term enrich- 
ment within a list of genes. In this test, the occurrence 
of a functional term within a gene list is compared to 
the background level of occurrence across all genes in 
the genome to determine the degree of enrichment. A 
p-value based on this test can be calculated from four 
parameters: (1) the number of genes within the list, (2) 
the frequency of a term within the gene list, (3) the 
total number of genes within the genome, and (4) the 
frequency of a term across all genes in the genome. 
This test effectively distinguishes truly overrepresented 
terms from those occurring at a high frequency across 
all genes in the genome and therefore within the gene 
list as well. The cumulative hypergeometric test assigns 
a p-value to each functional term associated with genes 
within a given list, and all functional terms are ranked 
by ascending p-value (i.e. by descending levels of enrich- 
ment). Huang et al. reviews the use of the hypergeo- 
metric test for functional term enrichment [34]. The 
Algal Functional Annotation Tool computes hypergeo- 
metric p-values using a Perl wrapper for the GNU 
Scientific Library cumulative hypergeometric function 
written in C to provide a quick and accurate implemen- 
tation of this statistical test. 

Dynamic Visualization of KEGG Pathway Maps 

Individual pathway maps from KEGG provide informa- 
tion on protein localization within the cell, compart- 
mentalization into different cellular components, or of 
reactions within a larger metabolic process. Visualization 
of proteins from gene lists onto pathway maps is useful 
for their interpretation. The Algal Functional Annota- 
tion Tool utilizes the publicly available KEGG applica- 
tion programming interface (API) for pathway 
highlighting. The information linking C. reinhardtii pro- 
teins to identifiers within the KEGG database is used to 
determine the subset of KEGG IDs within the supplied 
gene list associated with a particular pathway. The Algal 
Functional Annotation Tool also deduces which proteins 
within the pathway are located within the genome of C. 



reinhardtii but not found in the gene list and sends the 
corresponding identifiers to the KEGG API to be high- 
lighted in a different background color. This API inter- 
face is implemented using the SOAP architecture for 
web applications. 

Integration of Expression Data 

The expression levels of C. reinhardtii genes have been 
experimentally characterized under numerous conditions 
using high-throughput methods such as RNA-seq 
[[26,27], unpublished data (Castruita M., et al.)]. These 
expression data were compiled and analyzed to deter- 
mine which genes are over- and under-expressed in 
each experimental condition. The expression data was 
preprocessed to normalize the counts for uniquely map- 
pable reads in any experiment. Genes exhibiting greater 
than a two-fold change in expression compared to aver- 
age expression across all conditions with a Poisson 
cumulative p-value of less than 0.05 were considered 
differentially expressed. Using this data, C. reinhardtii 
genes were associated with conditions in which they 
were over- and under-expressed. 

The compiled expression data was also analyzed to 
find functionally related genes based on their expression 
levels across the different experimental conditions 
[[26,27], unpublished data (Castruita M., et al.)]. Genes 
demonstrating low variance of expression across all 
samples were not considered. This analysis was per- 
formed for three representations of the expression data: 
absolute counts, log counts, and log ratios of expression. 
By this method, C. reinhardtii genes are each associated 
with 100 genes with the most similar expression pat- 
terns to determine potentially functionally related genes. 

Gene Identifier Conversion 

Due to the existence of several protein identifier types 
(FM3.1, FM4, Au5, Aul0.2), different identifiers are 
associated with an individual protein within the Chlamy- 
domonas genome. In order to extend annotations from 
one identifier type to another, matching protein identi- 
fiers are deduced by sequence similarity filtering for 
mutual best hits between identifiers using BLAST. 
Matching identifiers with 100% sequence coverage are 
kept, and the rest of the mutual best hits are filtered to 
include only those proteins with matches with at least 
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75% coverage. Potential ambiguities involving proteins 
similar to multiple other proteins are resolved by con- 
sidering only the reciprocal best hit from the BLAST 
query in the opposite direction. The information derived 
by this analysis is used to convert gene identifiers 
between different types, which allows the Algal Annota- 
tion Tool to work with multiple protein identifier types. 

Web-Based Interface and Updates 

The web interface of the Algal Functional Annotation 
Tool consists of a set of portals that give access to the 
different types of analyses available. Results are shown 
within expandable/collapsible HTML tables that display 
annotation information along with the statistical results 
of the analysis. When expanded, the results table shows 
which gene identifiers contain a specific annotation along 
with further information regarding matching gene identi- 
fiers and BLAST E-values. Updates to the Algal Func- 
tional Annotation Tool are semi-automated using a set of 
Perl scripts that parse and process updated flat files from 
the various integrated annotation databases at regular 
intervals. Currently, functional data from the primary 
annotation databases is set to be updated every 4 months. 



Utility and Discussion 

Comprehensive, Integrated Data-Mining Environment 

The Algal Functional Annotation Tool is composed of 
three main components - functional term enrichment 
tests (which are separated by type), a batch gene identi- 
fier conversion tool, and a gene similarity search tool. A 
'Quick Start' analysis is provided from the front page, fea- 
turing enrichment analysis using a sample set of data- 
bases containing the richest set of annotations (Figure 1). 
From any page, the sidebar provides access to the 'Quick 
Start' function of the tool. 

Numerous other enrichment analyses - including 
enrichment using pathway, ontology, protein family, or 
differential expression data - are available within the 
Algal Functional Annotation Tool. Enrichment results 
are always sorted by hypergeometric p-value and when- 
ever possible contain links to the primary database's 
entry for that annotation or to the protein page of the 
gene identifier. The number of hits to a certain annota- 
tion term are also displayed alongside the p-value, and 
results may always be expanded to show additional 
details, such as the specific gene IDs within the list 
matching a certain annotation (Figure 2). These results 



Algal Functional Annotation Tool 

A tool to visualize pathway maps and identify enriched biological terms using lists of gene IDs. 



Pathway Maps 
Enriched Ontology Terms 
Protein Family Enrichment 
Gene ID Conversion 
Search Manual Annotations 
Expression Similarity Search 
About 
Example 

Quick start: 



Welcome to the Algal Functional Annotation Tool, a bioinformatics resource to visualize pathway maps, identify enriched biological terms, 
or convert algal gene identifiers to elucidate biological function in silico. 

Quick start - search all databases 

Enter a list of gene identifiers separated by commas, spaces, or lines. Alternatively, load sample data . 




Gene Identifier Type: [?] 

Augustus v5.0 Gene Models i } 

( Quick Start ) 



Gene identifier type: Augustus vS.O Gene Models T) I?] Advanced Options (Search all databases' ^ 

Augustus v5.0 gene models may be numerical protein IDs [I.e. 50294B) or alphanumeric model names (I.e. au5.g951_tl). 

Pathway maps - visualize proteins of interest within KEGG maps 

Dynamically visualize KEGG pathway maps with the provided proteins highlighted on the diagrams. Custom colored pathway maps can 
also be produced based on hits to individual biological pathways. Search pathway maps. 

Gene ontology - search for enriched GO and MapMan terms 

Search through databases containing biological processes, cellular components, and molecular functions to find enriched terms among a 
list of supplied proteins. Statistical calculations are performed on the results to show relevance. Search gene ontology. 

Gene identifier conversion 

Based on sequence similarity above a stringent threshold, find other identifiers that correspond to your proteins of interest to use in other 
databases. Convert gene identifiers. 

Manual annotation search 

Search against user-submitted JGI manual annotations using a list of protein IDs. These protein IDs are automatically interconverted to find 
the correct protein ID with the manual annotation attached, without needing to browse ail gene models at that locus. Search manual 
annotations. 

Figure 1 Algal Functional Annotation Tool. The front page of the Algal Functional Annotation Tool. A 'Quick Start' analysis is available to test 
for enrichment using the richest annotation databases included in the tool. Other features accessible from the sidebar include more specific 
enrichment tests (based on biological pathways, ontology terms, or protein families), a gene identifier conversion tool, a manual annotation 
search tool, and an expression similarity search tool. 
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Pathway results - KEGG pathways [20] 



KEGG Pathway 
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lb 


1 




189320 




||K01738 


J|4e-178 






59800 




|K003B7 


|2e-150 




I 


205485 




K03332 


|2e-129 




I 


131444 




|K00390 


J|5.2e^91 




I 


184419 




|K00860 


|V1e-69 






Represent "Sulfur metabolism" pathway 


using custom colors 










I 


Re-run functional enrichment analysis us 


ing only the subset of pr 


oteins in this pathway 








I* 


Cysteine and methonme metabolism 








12 


3.2806e-17 


\* 


Selenoammo acd metabolism 








9 


6.4241 e-1 6 


\* 


Metabo pathways 








22 


4.2704e-06 


K 


Thiamine metabolism 








3 


0.00010125 



Figure 2 Annotation Enrichment Results. Annotation enrichment results, sorted by ascending hypergeometric p-values, are shown in 
expandible/collapsible HTML tables such as the one shown. When expanded, the genes within the user-submitted list containing the expanded 
annotation are shown alongside additional statistical information. All results are downloadable as tab-delimited text files. 



are downloadable as tab-delimited text files which may 
then be further analyzed or used in conjunction with 
other databases. 

Dynamic visualization of KEGG pathway maps may be 
accessed from the results table for KEGG pathway 
enrichment by clicking on any pathway name. The pro- 
teins in the list that are members of the particular biolo- 
gical pathway will appear in red, while those proteins 
existing in Chlamyomonas reinhardtii but not in the list 
appear in green (Figure 3). Alternatively, by expanding 
the pathway results and following the link at the bottom, 
the user may select a custom color scheme for visualizing 
the proteins on pathway maps. These custom color 
schemes may be designed on a gene-by-gene basis 
(choosing colors individually for genes) or in a group-by- 
group fashion (such as choosing a color for those pro- 
teins found within the organism but not in the gene list). 

A list of genes may also be converted into a list of 
gene identifiers of another type. This feature allows easy 
transformation of gene IDs into corresponding models 
for use in other databases that may have additional 
annotation information. Additionally, the resulting list 
of gene identifiers may be used as a new starting point 
for enrichment analysis. Because of the different annota- 
tions associated with other gene identifier types (albeit 
of the same proteins), enrichment results using a 



converted set of gene IDs may yield new biological 
information. 

The gene similarity search tool, the third component 
of the Algal Functional Annotation Tool, accepts single 
genes and returns functionally related genes (based on 
gene expression across different experimental condi- 
tions) using user-specified distance metrics and thresh- 
olds. Presently, functionally related genes may be 
determined using correlation distance based on absolute 
counts, log counts, or log ratios of expression. The 
results page shows the original query gene at the top in 
gray and any resulting genes, sorted by similarity, are 
shown below the query gene (Figure 4). A colormap 
based on gene expression is generated for the different 
genes across the conditions, and this colormap may be 
changed to display absolute expression, log expression, 
or log ratios of expression. The distance between any 
gene and the original query gene is displayed by hover- 
ing the mouse over the gene identifier of interest. Quan- 
titative expression data (e.g. absolute counts) are 
provided for each experiment by hovering over the col- 
ormap. Whenever a description of a gene is available, 
this is displayed when hovering over the gene identifier 
as well. Links to external databases (e.g. JGI, KEGG) 
providing more information about the genes are pro- 
vided with the results. 
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SULFUR METABOLISM: REDUCTION AND FIXATION 



Glycine, serine and 
threonine metabolism 



2.8.2.1 


2.8.2.2 
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>■ 



— o-* 

Acetate 




Adenosine- 

3',5'-bisphosphate(PAP) 



O-Acetyl-I^serine O 



Figure 3 Dynamic Visualization of Gene Lists onto KEGG Pathway Maps Dynamic KEGG pathway maps may be visualized to show the 
different proteins within a user-submitted gene list. Shown is the 'Sulfur Metabolism' dynamic pathway with the matching proteins submitted 
highlighted in red. In this example, the submitted gene list is drawn from literature characterizing Chlamydomonas under sulfur-deprived 
conditions [26]. 



Ability to Re-Run Analysis for Subsets of Genes 

Once a gene list is supplied and enrichment results have 
been returned, a subset of genes corresponding to those 
that contain a particular annotation may be isolated and 
re-run through the tool to be analyzed as a separate, 
smaller gene list. This allows users to select a particu- 
larly interesting group of functionally related genes and 
isolate them to see if they are also enriched for other 
functional terms. This also allows the user to prune 
large gene lists into more focused lists of functionally 
similar genes and removing some of the inherent noise 
associated with high-throughput experimental techni- 
ques and their resulting gene lists. This feature of the 
tool may be accessed by expanding the enrichment 
results of a particular annotation and selecting to re-run 
the analysis using only that subset of proteins. From this 
step, users may select which database types to query for 
enrichment (e.g. pathway, ontology, protein family). 

Expanded Annotation Coverage 

The methods described to compensate for the incom- 
plete annotation coverage of Chlamydomonas reinhardtii 
genes resulted in the addition of a vast number of unique 



annotations to the genome. While there is a strong over- 
lap between pre-existing annotations and those assigned 
by inference, many new terms have also been added. The 
annotations derived by orthology, however, are not 
mixed with the annotations attained directly to decrease 
the possibility of false positive associations of functional 
terms that may distort the analysis, and to permit a com- 
parison with the functional terms derived directly from 
the Chlamydomonas annotation. 

Example - Sulfur-Related Genes 

Using a filtered list of C. reinhardtii genes derived from 
transcriptome sequencing of the green alga under sul- 
fur-depleted conditions [26], the Algal Functional Anno- 
tation Tool found enrichment for annotations related to 
sulfur metabolism, cysteine and methionine metabolism, 
and sulfur compound biosynthesis. For each annotation, 
the results may be expanded to reveal the genes con- 
taining that particular annotation. Furthermore, there is 
significant overlap between terms directly assigned to C. 
reinhardtii proteins and those inferred from A. thaliana 
orthology. Visualization of the sulfur metabolism KEGG 
pathway shows that a majority of the enzymes involved 
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Pathway Maps 
Enriched Ontology Terms 
Protein Family Enrichment 
Gene ID Conversion 
Search Manual Annotations 
Expression Similarity Search 
About 
Example 

Quick start: 




Gene Identifier Type: [?] 

Augustus vS.O Gene Models % 1 



Quick Start 



Correlated genes : 

Hover over a gene ID to show gene distance or over a point on the colormap to show expression information. 



ColOrmap display: LogZ Ratios (Exp/Mean) 



Analyze these genes for functional enrichment 



Gene ID 

116450 

116697 
114363 
115511 
119661 
169453 
205571 
193299 
105469 
195343 
193195 
153427 
76217 
111044 
111336 
139619 
126079 
190349 
140487 
169636 
150261 
119554 
32523 
114761 
151199 
192935 
153397 
30363 
149134 
106754 
196742 
205887 



Cu- Qj- Cu- 



Experlmental Conditions [Abbreviations] 

T1A T1B T2A T2B T2C T3A T3B T3C T3D T4A TAB T4C T40 TC 0 TC 1 TC2TC3TC4 



Distance from 116450: 0.946572 
Using the metric 1 correlation^' 
JGI Description IDefLine) : 
Adenylate Kinase 2 



Figure 4 Expression Similarity Search Tool Results. An example of the results from the Gene Similarity Search Tool. Pairwise distances 
between resulting genes and the submitted gene are shown in the lower right corner when the mouse hovers over a gene of interest. 
Whenever applicable, a short description of the resulting gene is also shown when hovering over a gene. Expression data is shown when 
hovering over a point of the colormap. 



in this biological process is in the sample list, and the 
reactions they catalyze may be seen on the pathway 
map. The results for any enrichment analysis may be 
downloaded as a tab-delimited text file. Taking a gene 
found to be associated with the KEGG pathway 'Sulfur 
metabolism' by this enrichment analysis (JGI v. 3 ID 
206154) as a starting input into the gene similarity 
search tool, the genes corresponding to sulfate transpor- 
ter, methionine synthase reductase, and cysteine dioxy- 
genase were found within the top 15 results using the 
correlation metric between log counts. 

Future Directions 

As with all tools that integrate data from multiple exter- 
nal sources, the power of analysis using the Algal Func- 
tional Annotation Tool is ultimately limited by the 
quality of the annotations within the primary databases. 
With the steady growth of knowledge in these annotation 



databases, the utility of the analyses provided is expected 
to increase in the future as more biological associations 
are assigned to genes. Additionally, as Chlamydomonas 
reinhardtii genes continue to be experimentally charac- 
terized, the assignment of manual annotations will also 
fill in the gaps left by automated annotation assignment 
and thus expand the annotation coverage throughout the 
genome, further improving the results generated by our 
portal. Lastly, the extensible nature of the Algal Func- 
tional Annotation Tool will allow us to add other algal 
organisms in the future using the same platform so that 
genomic data from other algal model organisms may be 
analyzed in a similar fashion as that currently available 
for Chlamydomonas reinhardtii. 

Conclusions 

The Algal Functional Annotation Tool is intended as a 
comprehensive analysis tool to elucidate biological 
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meaning from gene lists derived from high-throughput 
experimental techniques. Annotation sets from a num- 
ber of biological databases have been pre-processed and 
assigned to gene identifiers of the green alga Chlamydo- 
monas reinhardtii, and this annotation data may be 
explored in multiple ways, including the use of enrich- 
ment tests designed for large gene lists. Furthermore, 
the site enables the visualization of proteins within path- 
way maps. Using several methods, such as inferring 
annotations from orthologous proteins of other organ- 
isms, the initially sparse annotation coverage of C. rein- 
hardtii is alleviated, allowing for a more effective 
functional term enrichment analysis. Other functions of 
the tool include a batch gene identifier conversion tool 
and a manual annotation search tool. Lastly, similar 
genes based on expression across several conditions may 
be explored using the gene similarity search tool. 

Availability and Requirements 

Project name: Algal Functional Annotation Tool 

♦ Public web service: http://pathways.mcdb.ucla.edu; 
Free and no registration. 

♦ Programming language: Perl/CGI 
. Database: MySQL 

♦ Software License: GNU General Public License 
List of Abbreviations Used 

API: Application Programming Interface; BLAST: Basic Local Alignment Search 
Tool; CGI: Common Gateway Interface; DAVID: Database for Annotation, 
Visualization, and Integrated Discovery; GO: Gene Ontology; KEGG: Kyoto 
Encyclopedia of Genes and Genomes; JGI: Joint Genome Institute; SOAP: 
Simple Object Access Protocol. 
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